Introduction and Motivation
Getting data from the hotel industry has always been a real challenge. Although we can read about hotel jargons and words easily, searched through some of the standard descriptive statistics of the hotel industry in market research reports on the internet for the price of your next summer holiday, lesser is known about how the hotel actually works behind the scenes. In this article let us have sneak into the hotel data in detail and come up with interesting findings.
The “hotel booking demand datasets” compiled by Nuno Antonio, Ana de Almeida and Luis Nunes (Antonio, Almeida, and Nunes 2019) was a beautiful effort to overcome such challenge. This dataset was obtained from two hotels in Portugal - one city hotel in Lisbon and one resort in Algrave. Some of the sensitive information that could reveal the identity of the two hotels was not provided, but it did not affect the important role of this dataset for the purpose of education, management, machine learning and many others. Let us analyze and find out some intersting findings about the different workings of the hotels.
Data description
Dateset overview and structure
The datasets for the two hotels can be downloaded separately at ScienceDirect.com in the paper of Antonio, Almeida, and Nunes (2019). However, Mock (2020) at tidytuesday challenge had done us a favor and combined the two datasets into one. The two original datasets and the combined one can be obtained at this GitHub page.
For this study, we used only the combined dataset which is stored in a csv file format with 32 variables and 119,390 observations. Each observation represents one hotel booking.
Knowing this dataset belonged to a hotel, we can make sense most of the variables. However, not all of them are familiar for anyone who did not have a background of hotel management. We will run through some of the industry jargons and variables’meaning before we take a further look at the data.
Variables note:
- is_cancelled: (1) if the booking was cancelled and (0) if not.
- lead_time: number of days between when the booking was entered into the hotel’s booking system and the |arrival date.
- meal: Type of meal booked which can be:
- BB: Bed and Breakfast.
- FB: Full board (breakfast, lunch and dinner).
- HB: Half board (breakfast and one meal, usually dinner).
- Undefined/ SC: no meal package.
- country: Guests’country of origin.
- market_segment: guests’market segment, some of which may associate with booking channel.
- Direct: guests that make bookings directly with the hotels, could be from hotel’s website/ phone booking or walk-ins.
- Corporate: Guests whom bookings are made by corporate/ company or guests who are business travellers.
- Online TA: Online travel agents - bookings that made through a third party websites. Examples are Agoda, Expedia, Booking.com…
- Offline TA/TO: Bookings made by Travel agents or Tour operators.
- Complementary: Free stays offered for guests, usually from hotels’ promotional programs.
- Groups: guests who travelled in groups.
- Undefined: Undefined type of guests.
- Aviation: We are not entirely sure but this could be airline crews.
- Distribution_channel:
- Direct: bookings that made directly with the hotels (hotel’s websites, phone or walk-ins)
- Corporate: bookings made by corporate/company.
- TA/TO: Travel agents/ tour operators/
- Undefined: Undefined distribution channel.
- GDS: Global Distribution System. GDS served like a hub for companies in the travel industry (airlines, hotels, car rental…) to connect with travel agents. Hotels will put some of their inventories (rooms) to the GDS and travel agents then can sell those rooms to their customers. Some of the well-known GDS include Amadeus, Sabre and Galileo.
- is_repeated_guess: (1) if repeated and (0) if not.
- customer_type:
- Transient: Individuals or groups that occupy less than 10 rooms per night. These guests usually stay in the hotel short - term and require little services.
- Contract: bookings bound by contracts, usually for more than 30 days for a consistent block of rooms.
- Transient - Party: Transient booking but associated to other transient booking.
- Group: bookings associated to a group, usually occupy more than 10 rooms per night.
- previous_cancellations: number of previous cancellations prior to current booking by a customer.
- previous_bookings_not_canceled: number of previous non - cancelled bookings prior to current booking by a customer.
- booking_changes: number of changes made to the booking from when it was enterred into the system till the day of arrival/ cancellation.
- agent: ID of travel agency that made the bookings.
- company: ID of companies that made the bookings.
- days_in_waiting_list: number of days booking was in the waiting list before it was confirmed to customers.
- adr: Average Daily Rate, computed by taking total room revenue (excluded breakfast, tax and service charges) divided by total number of room nights sold.
Kindly note that you may notice that some variables contained the “NULL” values (eg: agent or company variable). This “NULL” value did not mean the value was missing, rather such value did not exist to begin with; for example a booking may not have the ID of an agent or a company associated with it as such booking was made by an individual customer.
Limitation
- This dataset contained data for two specific hotels in Portugal. Such limitation in study objects introduced challenges when we attempted to explain the trends observed as reasons could be hotel- specific and we could not use industry knowledge to cover.
- Some information of the hotels were not provided, for example the number of rooms, the occupancy rate, the location of the hotels (in the busy district or at the city suburban), years of operation or special events that might have occurred. The lack of information might render some of our questions unanswered.
- Eventhough we were provided with the collection method, we were not be able to verify the validity and correctness of the data. We noticed during our analysis that some of the entries were not sensible and could very likely due to the input errors. However we were unable to verify such concern.
- Since the observation began in July 2015 and ended in August 2017, we only have a fully cover data by year in 2016. Moreover, the coverage of the dataset in 2015 and 2017 are only six months and eight months, respectively. Hence, we would not analyze the data in year-wise manner because it might be not apple to apple to be compared.
Collection methods
Antonio, Almeida, and Nunes (2019) collected the data by extracting the variables from the hotels’ PMS (Property Management System) databases’ server with a TSQL query in SQL Server Studio Manager. The tables that were used to extract the variables are:
- BO (booking table in which the key, which is the ID, was retrieved).
- BL (bookings change log, in this case, if the booking details with respect to the day before arrival changed, the value used was the one present in this table).
- ML (meals).
- DC (distribution channel).
- TR (transaction).
- CP (customer profiles).
- NT (nationalities).
- MS (market segments).
A diagram below made by Antonio, Almeida, and Nunes (2019) presented the structure of the PMS databases:
Data Cleaning
Missing values checking
Bennett (2001) argued that it is important to take missing value into account, otherwise the statistical analysis will be misleading and variability of the data could not be estimated correctly. Thus, before analyzing the data, we checked the missing value using visdat package (Tierney 2017) first.
In children variable, when we found the missing value and we imputing it with the average of children (mean imputation) (Kang 2013) and created new variable called imputed_children.
We added this newly created column in the original dataset.
Figure 4.2 shows that there is no missing value in the dataset otherwise. It is inline with what Antonio, Almeida, and Nunes (2019) stated that there is no missing values in the database table. However, we must take a note that some “NULL” values were presented which should be interpreted as “not applicable”, not a missing value (Antonio, Almeida, and Nunes 2019). For example, if the the company value is NULL, it means that the booking was not made by a company.
Analysis and Findings
Tourism and Seasonality in Portugal
The tourism industry worries a lot about seasonality, as it will affect the flow of visitors to tourist destinations. The hotel season is divided into two main seasons: high and low seasons. As the name suggests, high season is a busy season when the weather is good and the guests’ inflows are high; low season vice versa.
Portugal is no exception to this. The high season in Portugal usually runs in summer (June to September) and in spring (January to March); the beaches are usually the busiest in July and August. The low season generally occurs during the winter season, which begins around November and ends at the end of February. Weather during this time can display rainfall, unexpected rain and a strong, cold breeze which is not too ideal for sightseeing (lisbonlisboaportugal.com, n.d.).
Keeping the season in mind, we are interested in finding out any potential effect of seasonality on the guest count and the ADR of our hotels in the report.
Which months across the two year, saw the most inflow of the tourist ?
The inflow of visitors to these two hotels will help us decide the months in Portugal are the best time to travel to Portugal, and which hotel is the place to stay when you travel to Portugal.
Months are described in the order of occurrence (July was the first year of each year) to maintain the chronological order of the dataset. The overall pattern has been nearly the same for both hotels and years. The seasonality pattern was similar to the “W” shape, with the lower points of W occurring in the winter months from November to January, and the high points in the spring and summer periods. Interestingly, both hotels had the highest number of guests in the spring season in Year 1 (May and March) while in Year 2, the highest tourism was recorded in the summer season (August and October).
We can therefore infer that most of the months are a good time to visit Protugal, particularly from July to October and January to March, the graph above shows that most of the guests prefer to stay in City Hotel compared to the Resort Hotel.
Which segment of the hotel market is more profitable and lets customers book their trip easily ?
Study of which is the best way to book your ticket would allow customers to select their services when booking trips to the hotel
OTA has been the major player in the business segment of these hotels, it only took over the supremacy of group reservations in the city hotel in a half year period. In the first semester of 2016, the proportion of the OTA was doubled than in the previous semester.
The bookings through OTA have dominated since the beginning of the time observed, while at the resort hotel in comparison to the city hotel, the bookings via OTA marginally decreased in the first semester of 2016 and the group bookings increased or customers started booking on their own. We could also see that both hotels had hit the peak in the proportion of OTA bookings in semester 2 of 2016.
Aside from OTA matter, Figure 5.4 shows another interesting fact that in the condition of OTA booking was dominating the market segment, the proportion of direct booking in the resort hotel was relatively stable, this could be the reason may be cause this resort has their own way to promote their direct booking, for example may be through a loyalty voucher
Where Did The Bookings Come From ?
Another way to obtain more business information is to look at the roots of the travellers who book hotel space. An understanding of their actions and preference is essential. Therefore the hoteliers will establish strategies for attracting them.
We may examine which part of the world is most drawn to Portuguese
Figure 5.5 provides a booking map of the country of origin to get a view of the booking distribution, throughout the globe, The hover options, helps us with the count of the guests who have visited Portugal. If you hover over the map, it tell us that Portugal sees more tourists from Europe than the rest of the continents
As this could be a topic of interest for most of you and each individual, may want to know the count of guests that have visited Portugal. The interactive table below, allows you to get a detail view. The table suggests that the bookings came from 177 different countries.
Let’s see the effect of ADR on the two types of hotels with different types of guests.
Let’s try to find descriptive statistics on how the average daily rate impacts the various categories of guests in the hotel, respectively. Have the foreign guests been helping to increase the hotel profit. Lets us find out.
We have analysed that most bookings in both hotels are from international travelers and the ADR median presented in Figure 5.6 Shows that foreign guests in both types of hotels paid more money than the locals. Therefore we might argue that these travelers have become the hotels’ valued customers. The hover option, gives us more detail in the summary statistics and we can infer which 25 percentile/75 percentile values of ADR in Euros. This will be a good insight for us, when we have a certain limit on the expenditure for trips can keep this percentiles as reference and evaluate our desired expenditure
Since the international travellers have become the hotel’s valued guest, it is easier to retain them through personalization as a potential long-term customer. According to Criton (2019) personalisation is the secret to the customer’s heart winning.
Antonio, Nuno, Ana de Almeida, and Luis Nunes. 2019. “Hotel Booking Demand Datasets.” Data in Brief 22: 41–49.
Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman. 2018. “Countrycode: An R Package to Convert Country Names and Country Codes.” Journal of Open Source Software 3 (28): 848. https://doi.org/10.21105/joss.00848.
Bennett, Derrick A. 2001. “How Can I Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25 (5): 464–69.
Jedina, Mohd Haniff, and Kohila Ranjinib. 2017. “Exploring the Key Factors of Hotel Online Booking Through Online Travel Agency.” In 4th International Conference on E-Commerce (Icoec) 2017 Held in Malaysia.
Kang, Hyun. 2013. “The Prevention and Handling of the Missing Data.” Korean Journal of Anesthesiology 64 (5): 402.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Talwar, Shalini, Amandeep Dhir, Puneet Kaur, and Matti Mäntymäki. 2020. “Why Do People Purchase from Online Travel Agencies (Otas)? A Consumption Values Perspective.” International Journal of Hospitality Management 88.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.